Signal modeling enhancements for automatic speech recognition
نویسندگان
چکیده
Obtaining a compact, information-rich representation of the speech signal is an important first step in ASR. A large majority of ASR systems use some form of cepstral coefficients for this purpose. Computation of these cepstral coefficients typically includes several of the following steps: (1) Highfrequency preemphasis, using an FIR filter of the form y(k) = x(k) ax(k-1), with a taking values around 0.95; (2) partition of the signal into analysis frames of 20 to 30 ms, spaced 5 to 10 ms apart; (3) computation of ten to forty cepstral coefficients using a cosine transform of the logarithm of the output of a 40-channel triangular filter bank, which is designed to approximate a Bark frequency scale; and (4) Feature vectors are assembled from the instantaneous cepstral values, augmented with some form of dynamic information, e.g. delta-cepstra. This paper describes several enhancements to this procedure. We show that significant improvements in recognition accuracy can be achieved by modifications in all of these steps, particularly for speech corrupted by noise. In particular, we show that 1. The first order high-frequency pre-emphasis should be replaced by a second order preemphasis of the form:
منابع مشابه
A Database for Automatic Persian Speech Emotion Recognition: Collection, Processing and Evaluation
Abstract Recent developments in robotics automation have motivated researchers to improve the efficiency of interactive systems by making a natural man-machine interaction. Since speech is the most popular method of communication, recognizing human emotions from speech signal becomes a challenging research topic known as Speech Emotion Recognition (SER). In this study, we propose a Persian em...
متن کاملبهبود عملکرد سیستم بازشناسی گفتار پیوسته بوسیله ویژگیهای استخراج شده از مانیفولدهای گفتاری در فضای بازسازی شده فاز
The design for new feature extraction methods out of the speech signal and combination of their obtained information is one of the most effective approaches to improve the performance of automatic speech recognition (ASR) system. Recent researches have been shown that the speech signal contains nonlinear and chaotic properties, but the effects of these properties are not used in the continuous ...
متن کاملAllophone-based acoustic modeling for Persian phoneme recognition
Phoneme recognition is one of the fundamental phases of automatic speech recognition. Coarticulation which refers to the integration of sounds, is one of the important obstacles in phoneme recognition. In other words, each phone is influenced and changed by the characteristics of its neighbor phones, and coarticulation is responsible for most of these changes. The idea of modeling the effects o...
متن کاملImproving the performance of MFCC for Persian robust speech recognition
The Mel Frequency cepstral coefficients are the most widely used feature in speech recognition but they are very sensitive to noise. In this paper to achieve a satisfactorily performance in Automatic Speech Recognition (ASR) applications we introduce a noise robust new set of MFCC vector estimated through following steps. First, spectral mean normalization is a pre-processing which applies to t...
متن کاملشبکه عصبی پیچشی با پنجرههای قابل تطبیق برای بازشناسی گفتار
Although, speech recognition systems are widely used and their accuracies are continuously increased, there is a considerable performance gap between their accuracies and human recognition ability. This is partially due to high speaker variations in speech signal. Deep neural networks are among the best tools for acoustic modeling. Recently, using hybrid deep neural network and hidden Markov mo...
متن کامل